alt text

CMSC320 Final Project -- Optimizing AirBnb

By Ian Costello, Domenic SanGiovanni, Ben Krochta

In recent years, AirBnb has revolutionized the lodging industry. No longer are consumers limited to expensive hotels, but instead can have access to a diverse set of options in location, pricing, and amenditities. We explore what factors are important in the ranking of so called "Superhosts" who may receive preferential bookings from customers. Further, we create a model to predict the average rating of a listing based on auxiliarily information. Understanding these factors could help AirBnb renters to better understand how they can optimize their listings.

Meanwhile an entire industry has sprung up to meet the demand for short-term rentals. Some hosts use Airbnb to subsidize part time living: simply rent out their home in the summer/winter months or if they are out of town for an extended period of time. Others may rent out a spare unit in a large apartment to generate additional revenue. Conversely, some hosts own multiple properties for the sole intent of leasing them on AirBnb. We explore if machine learning algorithms are able to classify these types of hosts based of their rentals which could help regulators understand this new complicated industry.

This page will act partly as an investigation and a tutorial. We will be walking through the various steps of the data science pipeline. There are hundreds of articles about machine learning on AirBnB data, but none walk you through the process from start to finish. This is what we aim to accomplish here.

  1. Data Collection and Preprocessing
  2. Exploratory Visualization and Analysis
  3. Hypothesis Testing and Machine learning

Data Collection

AirBnb provides an extensive public AirBnb listings dataset. In this project, we decided to focus on listings in New York City's five boroughs. For easy code reproduction, we included a permalink to the dataset, which we download directly through pandas. We show an off representative listing of the dataset.

In [1]:
import pandas as pd
import numpy as np
from plotnine import *

df = pd.read_csv('http://data.insideairbnb.com/united-states/ny/new-york-city/2020-04-08/data/listings.csv.gz')
df.iloc[3654:3655, :]
Out[1]:
id listing_url scrape_id last_scraped name summary space description experiences_offered neighborhood_overview ... instant_bookable is_business_travel_ready cancellation_policy require_guest_profile_picture require_guest_phone_verification calculated_host_listings_count calculated_host_listings_count_entire_homes calculated_host_listings_count_private_rooms calculated_host_listings_count_shared_rooms reviews_per_month
3654 2713887 https://www.airbnb.com/rooms/2713887 20200408162728 2020-04-09 Furnished Studio UES near the Met Furnished 1-bedroom five blocks from Central P... NaN Furnished 1-bedroom five blocks from Central P... none NaN ... f f strict_14_with_grace_period f f 1 1 0 0 0.06

1 rows × 106 columns

Data Preprocessing

Looking at the above dataframe, we see that it contains over a hundred columns, many of which contain irrelevant or duplicate information that are not directly suitable for data science. In below sections, we identity of subset of useful columns and perform the necessary preprocessing to coerce these columns into the correct format.

Further, as seen above, many of these columns contain null values. Given the massive size of the AirBnb public dataset, we need a robust method to evaluate which features of the dataset have sufficient information to explore. Let's percentages of entities in a feature column that non-null.

A thorough guide on data preprocessing can be found below. https://towardsdatascience.com/data-cleaning-series-with-python-part-1-24bb603c82c8?gi=76b4899af990

In [2]:
features_to_evaluate = ['square_feet', 'transit', 'host_response_time', 'is_business_travel_ready']

def evaluateList(df, features_to_evaluate):
  total_entities = df.shape[0]
  print(f"Percentage of Valid Entities For Each Feature")

  for feature in features_to_evaluate:
    has_feature = list(df[feature].isnull())
    non_null_count = sum([1 for x in has_feature if not x])
    print(f"Total of {100 * round(non_null_count / total_entities,3)}% for {feature}")

evaluateList(df, features_to_evaluate)
Percentage of Valid Entities For Each Feature
Total of 0.8% for square_feet
Total of 65.4% for transit
Total of 58.9% for host_response_time
Total of 100.0% for is_business_travel_ready

So let's avoid using square feet, but it seems we could get sufficient information from these other columns.

Description of Features Used

Information about listing

  • Total beds (discrete numeric): total number of beds in unit
  • Is Location Exact (binary): Is the exact address provided?
  • Property Type (Categorical): Apartment, Guest Suite, Other
  • Strong Description: If host provide a description of space and transit information
  • Price per night
  • Availability of apartment during year

Information about Host

  • Host Response Time
  • Host Response Rate
  • Host Acceptance Rate
  • Host Total Listings Count
  • Host is superhost
In [3]:
import math
info_about_listing = ['beds', 'is_location_exact', 'property_type', 'price', 'availability_365', 'is_business_travel_ready', 'review_scores_rating','room_type']
host_info = ['host_response_time', 'host_response_rate', 'host_acceptance_rate', 'host_total_listings_count', 'host_is_superhost', 'host_id']

# Grab Subset of Columns
bnb_df = pd.DataFrame(data = df[info_about_listing + host_info])

# Convert df to correct types
bnb_df['host_is_superhost'] = bnb_df['host_is_superhost'].apply(lambda x: True if x=="t" else False)
bnb_df['is_location_exact'] = bnb_df['is_location_exact'].apply(lambda x: True if x=="t" else False)
bnb_df['price'] = pd.to_numeric(bnb_df['price'].apply(lambda x: x.lstrip("$").replace(".00", "").replace(",","")))

# Subset the apartment types
apartment_types = ['Apartment', "Guest suite", "Townhouse", "Hotel"]
def simplifyApartmentType(type):
  if type in apartment_types:
    return type
  else:
    return "Other"
bnb_df['property_type'] = bnb_df['property_type'].apply(simplifyApartmentType)

# Changes a percent to an int, keeps NaN
def change_to_percent(x):
  if type(x) == float and math.isnan(x):
    return float('nan')
  else:
    return float(x[:-1])

# Converts response and acceptance to percent
bnb_df['host_response_rate'] = bnb_df['host_response_rate'].apply(change_to_percent)
bnb_df['host_acceptance_rate'] = bnb_df['host_acceptance_rate'].apply(change_to_percent)

bnb_df.head()
Out[3]:
beds is_location_exact property_type price availability_365 is_business_travel_ready review_scores_rating room_type host_response_time host_response_rate host_acceptance_rate host_total_listings_count host_is_superhost host_id
0 2.0 True Other 100 365 f 80.0 Private room a few days or more 22.0 50.0 0.0 False 2259
1 1.0 False Apartment 225 365 f 94.0 Entire home/apt within a few hours 93.0 36.0 6.0 False 2845
2 4.0 True Guest suite 89 233 f 89.0 Entire home/apt within an hour 89.0 95.0 1.0 False 4869
3 1.0 False Apartment 200 0 f 90.0 Entire home/apt NaN NaN 75.0 1.0 False 7322
4 1.0 False Apartment 60 365 f 90.0 Private room NaN NaN 67.0 1.0 False 7356
In [4]:
bnb_df.query("price > 10000")
Out[4]:
beds is_location_exact property_type price availability_365 is_business_travel_ready review_scores_rating room_type host_response_time host_response_rate host_acceptance_rate host_total_listings_count host_is_superhost host_id
38242 1.0 True Other 25000 0 f 92.0 Hotel room within an hour 100.0 97.0 5.0 False 262458398
38243 2.0 True Other 25000 0 f NaN Hotel room within an hour 100.0 97.0 5.0 False 262458398
38883 2.0 True Other 25000 0 f NaN Hotel room a few days or more 0.0 NaN 4.0 False 269288731

Can We Predict Rating?

Given the AirBnB features for a listing, are we able to predict the review rating?

To accomplist this we first need to expand on the bnb_df that we worked on before in order to include a few more features.

Features Added:
  • Neighborhood
  • Room Type
  • How many people it accomodates
  • How many bathrooms
  • How many bedrooms
  • Type of beds
  • How many guests included
  • Is the AirBnB business travel ready
  • Is the host a superhost
Features Removed:
  • Is location exact
  • Host Response Time
In [5]:
rating_pred = bnb_df.copy(deep=True)
rating_pred[['neighbourhood', 'room_type', 'accommodates', 'bathrooms', 'bedrooms', 'beds', 'bed_type','guests_included']] = df[['neighbourhood', 'room_type', 'accommodates', 'bathrooms', 'bedrooms', 'beds', 'bed_type', 'guests_included']]
rating_pred = rating_pred.drop(columns=['is_location_exact', 'host_response_time'])

Next, we need to process this data in order to run it. We are going to use a DecisionTreeClassifier, but in order to do this we need all variables in numerical types. Thus, we first need to change boolean to 1 or 0.

In [6]:
# Convert 't'/'f' to 1/0 respectively
def change_char_to_int(x):
  if x == 't':
    return 1
  else:
    return 0
# Convert True/False to 1/0 respectively
def change_bool_to_int(x):
  if x:
    return 1
  else:
    return 0

rating_pred['is_business_travel_ready'] = rating_pred['is_business_travel_ready'].apply(change_char_to_int)
rating_pred['host_is_superhost'] = rating_pred['host_is_superhost'].apply(change_bool_to_int)

Finally, we need to convert categorical variables into numerical. In order to do this using the pandas method get_dummies. This takes a variable and adds as many columns as possible outcomes for the variable, and sets 1 to the column that the variable actually is. For example bed_type can either be Regular, Futon, Couch.... This then adds new columns bed_type_isRegular, bed_type_isFuton etc. and sets the proper column to 1.

In [7]:
# Encode the categorical columns as numeric
rating_pred = pd.get_dummies(rating_pred, columns=['property_type', 'room_type', 'neighbourhood', 'bed_type'])

# remove columns with NaN as they are generally premature AirBnBs
rating_pred = rating_pred.dropna()
rating_pred
Out[7]:
beds price availability_365 is_business_travel_ready review_scores_rating host_response_rate host_acceptance_rate host_total_listings_count host_is_superhost host_id ... neighbourhood_Williamsburg neighbourhood_Windsor Terrace neighbourhood_Woodhaven neighbourhood_Woodlawn neighbourhood_Woodside bed_type_Airbed bed_type_Couch bed_type_Futon bed_type_Pull-out Sofa bed_type_Real Bed
1 1.0 225 365 0 94.0 93.0 36.0 6.0 0 2845 ... 0 0 0 0 0 0 0 0 0 1
2 4.0 89 233 0 89.0 89.0 95.0 1.0 0 4869 ... 0 0 0 0 0 0 0 0 0 1
5 1.0 79 255 0 84.0 100.0 100.0 1.0 0 8967 ... 0 0 0 0 0 0 0 0 0 1
7 2.0 150 101 0 94.0 100.0 27.0 4.0 1 7549 ... 0 0 0 0 0 0 0 0 0 1
8 1.0 99 0 0 97.0 100.0 54.0 1.0 1 7989 ... 0 0 0 0 0 0 0 0 0 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
50072 8.0 65 365 0 100.0 100.0 100.0 0.0 0 340793389 ... 0 0 0 0 0 0 0 0 0 1
50123 2.0 93 86 0 100.0 100.0 100.0 0.0 1 20405437 ... 0 0 0 0 0 0 0 0 0 1
50178 1.0 34 54 0 20.0 100.0 50.0 0.0 0 16426264 ... 0 0 0 0 0 0 0 0 0 1
50184 1.0 45 21 0 80.0 100.0 100.0 0.0 0 26117696 ... 0 0 0 0 0 0 0 0 0 1
50250 1.0 75 49 0 100.0 94.0 96.0 2.0 0 197372149 ... 0 0 0 0 0 0 0 0 0 1

24309 rows × 225 columns

Part 2) Exploratory Data Analysis

Distribution of Review Scores

In [8]:
(ggplot(bnb_df, aes(x='review_scores_rating'))
         + geom_histogram(bins=20)
         + labs(title="Review Scores of Property Listings",
             x = "Review Score",
             y = "Count"))
/Users/iancostello/opt/anaconda3/envs/CMSC320/lib/python3.8/site-packages/plotnine/layer.py:360: PlotnineWarning: stat_bin : Removed 11681 rows containing non-finite values.
  data = self.stat.compute_layer(data, params, layout)
Out[8]:
<ggplot: (326536494)>

The above histogram shows the distribution of review scores for all listings in the NYC dataset. As you can see, many listings tend to have an overall high rating with a few trailing scores. This dataset also includes removed listings, which may help to reduce survivorship bias leading to higher averages.

Listings by Unique Hosts

In [ ]:
# df[['host_id', 'host_total_listings_count']]
unique_hosts = df[['host_id', 'host_total_listings_count']].groupby('host_id').agg('mean')
In [ ]:
(ggplot(unique_hosts.query("host_total_listings_count < 10"), aes(x='host_total_listings_count'))
         + geom_histogram(bins=10)
         + labs(title="Number of Listings For Hosts",
             x = "Number of Listings",
             y = "Life Expectancy (Years)"))
Out[ ]:
<ggplot: (8743484670875)>

The above histogram shows the total number of listings per unique host. From the above histogram, a majority of hosts have just a single listing, but a non-neglible amount of hosts own significantly more. Can we identify these differences in a systematic way?

Relating Price to Ranking

In [9]:
(ggplot(bnb_df[bnb_df['price'] < 1000].replace({'host_is_superhost': {0:'Not Superhost',1:'Superhost'}}), aes(x='review_scores_rating', y='price'))
         + geom_point()
         + facet_grid("property_type~host_is_superhost")
         + geom_smooth(method='lm', color="red")
         + labs(title="Rating Vs Price Conditioned on Host Information",
             x = "Review Scores",
             y = "Price"))
/Users/iancostello/opt/anaconda3/envs/CMSC320/lib/python3.8/site-packages/plotnine/layer.py:452: PlotnineWarning: geom_point : Removed 11464 rows containing missing values.
  self.data = self.geom.handle_na(self.data)
Out[9]:
<ggplot: (326531720)>

We can clearly see a relationship that super hosts tend to have a higher rating (not surprising), but is there a deeper relationship between price, ratings, and the super hosts designation that we can uncover?

Visualizing Listings

Here we show the distribution of Airbnb’s in New York City by using a heatmap with ipyleaflet. It appears that most Airbnb’s are in Manhattan and Brooklyn, while the other Burroughs have some but with a lower density.

In [ ]:
from ipyleaflet import *

map = Map(center=(40.71, -73.9), zoom=11,scroll_wheel_zoom=True)
map.layout.width = '100%'
map.layout.height = '900px'
heatmap = Heatmap(locations=[[row['latitude'],row['longitude']] for index,row in df.iterrows()], radius=10)
map.add_layer(heatmap)
map

3) Hypothesis Testing and ML¶

We will seek to answer to following questions. Can we accurately predict if a user is a super host and what are the most important factors? What are the unique types of hosts? Can we accurately predict the rating of a rating?

Part 1) Can we predict if a user is a Superhost?

Using our pre-processed dataframe, we will train multiple classification models where we drop one of our feature columns for each model. This will help us understand which features are important when considering how AirBnb classifies super hosts. We will rank the relative performance of these models using area of the curve (AOC) of the receiver operating characteristics (ROC). Essentially we estimate the true positive vs false positive rate of each of our models for each cross-validation validation step. A higher area under the curve represents a better predicting model. You can read more about AUC - ROC curves from the popular towards data science blog.

In [ ]:
unique_hosts = bnb_df.groupby('host_id').agg('mean')
unique_hosts.head()
Out[ ]:
beds is_location_exact price availability_365 review_scores_rating host_response_rate host_acceptance_rate host_total_listings_count host_is_superhost
host_id
2259 2.000000 1.0 100.000000 365.000000 80.000000 22.0 50.0 0.0 0.0
2438 2.000000 1.0 95.000000 21.000000 NaN NaN 100.0 0.0 0.0
2571 3.000000 1.0 182.000000 269.000000 98.000000 100.0 60.0 1.0 1.0
2782 1.000000 1.0 105.000000 156.000000 94.500000 80.0 56.0 2.0 0.0
2787 1.333333 0.0 84.166667 197.333333 95.333333 100.0 92.0 6.0 1.0
In [ ]:
# Data Preprocessing
X = unique_hosts.drop('host_is_superhost', axis=1)
y = unique_hosts['host_is_superhost']

X = X.fillna(X.mean())
y = y.fillna(0)

ROC Curve Estimation

We define a function to take in an input X and y, which then performs 5 fold cross validation, trains the model, and returns the AUC ROC estimates.

In [ ]:
from sklearn.neighbors import KNeighborsClassifier
import sklearn

model = KNeighborsClassifier(n_neighbors=2)

# Let's calculate 
def calculateROC(X,y):
  curve_df = None
  aucs = []
  mean_fpr = np.linspace(0, 1, 100)

  cv_obj = sklearn.model_selection.StratifiedKFold(n_splits=5)

  # Loop over each train/test validation split
  for i, (train, test) in enumerate(cv_obj.split(X, y)):
      # Train the model and predict on the test set
      model.fit(X.iloc[train], y.iloc[train])
      scores = model.predict_proba(X.iloc[test])[:,1]
      
      # Calculate false positives vs true positives
      fpr, tpr, _ = sklearn.metrics.roc_curve(y.iloc[test],scores)
      
      # Calculate linear interpolation
      interp_tpr = np.interp(mean_fpr, fpr, tpr)
      interp_tpr[0] = 0.0
      
      # Add to or set to global frame
      tmp = pd.DataFrame({'fold':i, 'fpr': mean_fpr, 'tpr': interp_tpr})
      curve_df = tmp if curve_df is None else pd.concat([curve_df, tmp])
      
      # Add AUC value for curve
      aucs.append(sklearn.metrics.auc(fpr, tpr))
      
  # Refactor and return
  curve_df = curve_df.groupby('fpr').agg({'tpr': 'mean'}).reset_index()
  curve_df.iloc[-1,1] = 1.0
  auc_df = pd.DataFrame({'fold': np.arange(len(aucs)), 'auc': aucs})
  return curve_df, auc_df 

Evaluate Performce

We will now evaluate the relative performce of each dataset with one feature missing.

In [ ]:
# Set of data for each run
curve_df = pd.DataFrame()
auc_df = pd.DataFrame()

# For each feature in our dataset
for feature in X.columns:
  # Train the ROC curve with the feature dropped
  X1_w_o_feature = X.drop(feature, axis=1)
  curve_df_st, auc_df_st = calculateROC(X1_w_o_feature, y)

  # Add it to our global dataframe
  curve_df_st['wo_feature'] = feature
  auc_df_st['wo_feature'] = feature
  curve_df = pd.concat([curve_df, curve_df_st])
  auc_df = pd.concat([auc_df, auc_df_st])
In [ ]:
(ggplot(auc_df, aes(x='wo_feature', y='auc')) + 
     geom_jitter(position=position_jitter(0.1)) +
     coord_flip() +
     labs(title = "AUC Comparison",
          x="Dataset Used",
          y="Area under ROC curve"))
Out[ ]:
<ggplot: (-9223363293401647132)>
In [ ]:
# Estimate if there is a statistically significant difference 
import statsmodels.formula.api as smf
lm_fit = smf.ols('auc~wo_feature', data=auc_df).fit()
lm_fit.summary()
Out[ ]:
OLS Regression Results
Dep. Variable: auc R-squared: 0.772
Model: OLS Adj. R-squared: 0.722
Method: Least Squares F-statistic: 15.50
Date: Mon, 18 May 2020 Prob (F-statistic): 1.14e-08
Time: 00:21:13 Log-Likelihood: 104.21
No. Observations: 40 AIC: -192.4
Df Residuals: 32 BIC: -178.9
Df Model: 7
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 0.8030 0.009 89.834 0.000 0.785 0.821
wo_feature[T.beds] -0.0338 0.013 -2.673 0.012 -0.060 -0.008
wo_feature[T.host_acceptance_rate] -0.0510 0.013 -4.037 0.000 -0.077 -0.025
wo_feature[T.host_response_rate] -0.0467 0.013 -3.695 0.001 -0.072 -0.021
wo_feature[T.host_total_listings_count] -0.0363 0.013 -2.875 0.007 -0.062 -0.011
wo_feature[T.is_location_exact] -0.0353 0.013 -2.792 0.009 -0.061 -0.010
wo_feature[T.price] 0.0149 0.013 1.182 0.246 -0.011 0.041
wo_feature[T.review_scores_rating] -0.1026 0.013 -8.116 0.000 -0.128 -0.077
Omnibus: 5.816 Durbin-Watson: 1.572
Prob(Omnibus): 0.055 Jarque-Bera (JB): 4.610
Skew: -0.797 Prob(JB): 0.0998
Kurtosis: 3.475 Cond. No. 8.89


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [ ]:
mean_curve_df = curve_df.groupby(['wo_feature','fpr']).agg({'tpr': 'mean'}).reset_index()
(ggplot(mean_curve_df, aes(x='fpr', y='tpr', color='wo_feature')) +
    geom_line() +
    labs(title = "ROC curves",
         x = "False positive rate",
         y = "True positive rate"))
Out[ ]:
<ggplot: (-9223363293401744205)>

A lower ROC score for a missing feature implies that score is an important figure. From the above plots and output, we can see that review scores are the single most important in determining if a user is a super host. Further, neither price nor availability are important factors. Interestingly, it seems that AirBnb remains relatively fair about who is deemed a super host as lower-cost, part-time renters may still be deemed super hosts.

Are there unique types of hosts?

We will use an unsupervised cluster algorithm to find which groups of hosts are the most closely related. Unsupervised algorithms find relationships and groupings within data without explicit labels. You can read more about the topic here: https://towardsdatascience.com/introduction-to-unsupervised-learning-8f1b189e9050?gi=50af3b45ff0c.

In [ ]:
import scipy

# For categorical columns use the mode
unique_hosts_mean = bnb_df.groupby('host_id').agg('mean')
unique_hosts_mode = bnb_df[['host_id','room_type']].groupby('host_id').agg(lambda x: scipy.stats.mode(x)[0][0])

# Use mean for numeric columns
unique_hosts = unique_hosts_mean.join(unique_hosts_mode)
unique_hosts.replace({'room_type': {'Private room':0,'Entire home/apt':1,'Shared room':2,'Hotel room':3}},inplace=True)
unique_hosts
Out[ ]:
beds is_location_exact price availability_365 review_scores_rating host_response_rate host_acceptance_rate host_total_listings_count host_is_superhost room_type
host_id
2259 2.000000 1.0 100.000000 365.000000 80.000000 22.0 50.0 0.0 0.0 0
2438 2.000000 1.0 95.000000 21.000000 NaN NaN 100.0 0.0 0.0 1
2571 3.000000 1.0 182.000000 269.000000 98.000000 100.0 60.0 1.0 1.0 1
2782 1.000000 1.0 105.000000 156.000000 94.500000 80.0 56.0 2.0 0.0 0
2787 1.333333 0.0 84.166667 197.333333 95.333333 100.0 92.0 6.0 1.0 0
... ... ... ... ... ... ... ... ... ... ...
343198023 1.000000 1.0 79.000000 9.000000 NaN NaN 100.0 0.0 0.0 1
343380074 3.000000 1.0 112.000000 335.000000 NaN NaN NaN 0.0 0.0 0
343381111 3.000000 1.0 60.000000 0.000000 NaN NaN NaN 0.0 0.0 1
343382932 3.000000 1.0 70.000000 224.000000 NaN NaN NaN 0.0 0.0 1
343403269 1.000000 1.0 53.000000 175.000000 NaN NaN NaN 1.0 0.0 0

37758 rows × 10 columns

In [ ]:
from sklearn.cluster import KMeans
from sklearn import preprocessing

# We need to fill in empty values
X = unique_hosts
X = X.fillna(X.mean())

# Then scale all columns so that they are equally weighted
scaler = preprocessing.MinMaxScaler()
X = scaler.fit_transform(X)

# Perform Kmeans clustering with 3 groups
cluster_clf = KMeans(n_clusters=3, random_state=42).fit(X)

# Set distint colors for each clusters
colors = {0: "red", 1: "green", 2: "blue"}
def cluster_to_color(num):
  return colors[num]

# Add clusters back to original dataframe
unique_hosts_mod = unique_hosts.copy()
unique_hosts_mod['cluster'] = cluster_clf.labels_
unique_hosts_mod['cluster'] = unique_hosts_mod['cluster'].apply(cluster_to_color)
unique_hosts_mod.head()
Out[ ]:
beds is_location_exact price availability_365 review_scores_rating host_response_rate host_acceptance_rate host_total_listings_count host_is_superhost room_type cluster
host_id
2259 2.000000 1.0 100.000000 365.000000 80.000000 22.0 50.0 0.0 0.0 0 red
2438 2.000000 1.0 95.000000 21.000000 NaN NaN 100.0 0.0 0.0 1 green
2571 3.000000 1.0 182.000000 269.000000 98.000000 100.0 60.0 1.0 1.0 1 blue
2782 1.000000 1.0 105.000000 156.000000 94.500000 80.0 56.0 2.0 0.0 0 red
2787 1.333333 0.0 84.166667 197.333333 95.333333 100.0 92.0 6.0 1.0 0 blue

Visualizing Clusters

In [ ]:
(ggplot(unique_hosts_mod.query("host_total_listings_count < 50"), aes(x='availability_365', y='host_total_listings_count', color="cluster"))
         + geom_point()
         + labs(title="Review Scores of Property Listings",
             x = "Year-Round Availability",
             y = "Total Listings Of Host"))
Out[ ]:
<ggplot: (-9223363293401651361)>

The above plot shows year-round availability vs the total listings of a host by the determined clusters for our dataset. There is a clear cluster split between availability showing that hosts are uniquely segmenting based on if they lease the apartment full-time or only part of the time. Interestingly, there seems to be a third cluster: let's explore what it might be.

Investigating Third Cluster

We investigated various metrics to determine what the third grouping could be.

In [ ]:
(ggplot(unique_hosts_mod.query("price <10000").replace({'room_type': {0:'Private room',1:'Entire home/apt',2:'Shared room',3:'Hotel room'}}).replace({'host_is_superhost': {0:'Not Superhost',1:'Superhost'}}),
        aes(x='availability_365', y='price', color="cluster"))
         + geom_point()
         + facet_grid('host_is_superhost~room_type')
         + labs(title="Review Scores of Property Listings",
             x = "Review Scores",
             y = "Total Apartment Price"))
Out[ ]:
<ggplot: (-9223363293401688218)>

The above plot shows the same availability versus price, but we faceted the plot by apartment type and by superhost. As we can see, there is no significant differences between apartment type between hosts, but there is a significant difference between super hosts and regular hosts. This may imply that the super host ranking is actually somewhat meaningful and not just an arbitrary ranking.

Can we predict the rating?

We will use a decision tree, which essentially is a series of trained questions that follow a tree structure to end at a final leaf for prediction. You can read further about this technique here: https://towardsdatascience.com/the-complete-guide-to-decision-trees-28a4e3c7be14.

An important part of the app itself is being able to tell how high an AirBnB rating is. However, there are times when this information is not available, and this is when having a classifier in order to predict the rating of an AirBnB would be important. To do this, we can use a Decision Tree Classifier, since we need to use regression in order to predict this value.

In [ ]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier # used for the classifier
from sklearn.model_selection import train_test_split # used to make a test and train portion of data
from sklearn import metrics # metrics for accuracy

# Features that go into it
X = rating_pred[['beds','price','availability_365','is_business_travel_ready','host_response_rate','host_acceptance_rate','host_total_listings_count','host_is_superhost','host_id','accommodates','bathrooms','bedrooms','guests_included','property_type_Apartment','property_type_Guest suite','property_type_Hotel','property_type_Other','property_type_Townhouse', 'room_type_Entire home/apt', 'room_type_Hotel room', 'room_type_Private room', 'room_type_Shared room', ,'neighbourhood_Woodside','bed_type_Airbed', 'bed_type_Couch', 'bed_type_Futon', 'bed_type_Pull-out Sofa','bed_type_Real Bed']]
# Ratings column
y = rating_pred['review_scores_rating']

# Set training
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1) # 70% training and 30% test

# Create classifier
classifier = DecisionTreeClassifier()

# Train the classifier
classifier = classifier.fit(X_train, y_train)

#Predict the response for test dataset
y_pred = classifier.predict(X_test)

print("Absolute Mean:", metrics.mean_absolute_error(y_test, y_pred))
Absolute Mean: 5.486344192168477

Analysis

As we can see the classifier has an absolute mean between 5 and 6, meaning that the average difference between the prediction and the actual is between 5 and 6, which is a decent estimate for an AirBnB. As we can see, using most of the information provided we can decently predict what the review would be for AirBnB which could help potential customers make predictions if they are relatively new AirBnBs. Since the ratings are between 0 and 100, being within 5 is fairly precise.

There are some things that could be improved in this model though. The most important element that could be improved is the features. We could implement all of the features that were not added to the classifier. In addition, we dropped rows that included NaN as the lack of information is a problem for the classifier, but we could have taken more steps in order to not discount this information. Lastly, predicting the rating could be somewhat bias as there are confounding variables included with the rating, for example, how genuine the owner is, or if the house looks decent inside or not. These are not seen with this data, but could play a large role in these predictions.